I am interested in education data and predicting the success of students in better ways hopefully to reduce the amount of testing (sampling) that is needed for each student, allowing them and their teachers to spend more time on preductive learning instead of time test taking or preparing for said tests. The data set I chose to explore is the “Regulatory Adjusted Four-Year Cohort Graduation Rates, School Year 2011-12” datset from EDFact, an initiative set forth by the U.S. Department of Education to “put performance data at the center of policy, management, and budget decisions for all K-12 educational programs”. [1]
The data has some values that need changing for my initial exploratory purposes. These values arise due to the need to protect student information under the Family Educational Rights and Privacy Act (FERPA). For instance, in order to maintain student anonymity, small samples (5 students or less) have their values omitted (replaced with “PS”). I will be ignoring these cells in calculations that use them. Also, medium sized samples (from 6 to 300 students) have varying degrees of “blurring”, where exact graduation percentages are not reported, but ranges are used instead. I replace these ranges with point estimates that are normally distributed, centered around the range’s midpoint, but never allowed to extend past the range’s maximum or minimum.
Additionally, since I’m only looking at the 2011-2012 school year, the 1112 identifier on many of the column names is unnecessary, so I rename those columns, removing that portion of the name. E.g., “ALL_COHORT_1112” becomes “ALL_COHORT”.
Note: All of the above changes were preprocessed and saved into the “graduation_rate_data.csv” file. To see how these calculations were performed, see my raw code file titled “P4_scratchpad.R”.
data <- read.csv("graduation_rate_data.csv")
#Some of the schools were resporting way more students than the others.
#I looked up the largest ever graduating class in the U.S.
#and it was 1561 students, so I removed all rows with cohorts over 1600.
#(https://en.wikipedia.org/wiki/Plano_East_Senior_High_School)
data <- data %>% filter(ALL_COHORT<1600)
I’m interested in what factors could be affecting the graduation rates, as well. So I found another data set[2] that has school data for pupil/teacher ratios and revenue data, which I suspect influence graduation rates.
school_districts <- read.csv("school_district_data.csv")
#Combine extra data from school_districts into main dataframe
data <- left_join(data, school_districts, by="leaid11")
data <- data %>% rename(subregion=County.Name)
#Add computed column for revenue per student since this is
#probably more predictive that total revenue
data$Total.General.Revenue.Per.Student <-
data$Total.General.Revenue/data$Total.Students
#Remove Infinities
data <- data %>%
mutate(Total.General.Revenue.Per.Student=
ifelse(Total.General.Revenue.Per.Student==Inf,
NA,
Total.General.Revenue.Per.Student))
#Remove schools that have per student revenues of over the .999 quantile.
data <- data %>%
filter(Total.General.Revenue.Per.Student <
quantile(Total.General.Revenue.Per.Student, .999, na.rm=TRUE))
#Remove Schools that have over 100 pupils per teacher. Either these values
#are mistaken, or these schools have atypical class environments and I don't
#want them affecting my analysis.
data <- data %>% filter(Pupil.Teacher.Ratio<=100)
My initial hypothesis about the data is that one of three variables (or some combination thereof) has an effect of the Graduation Rate for a given school. The three variables I’m hypothesising to have an effect are
In the analysis and exploration that follows, I look primarily at these variables (henceforth referred to as the hypothesis variables). Although I will periodically look at other aspects of the data.
The most important variable in my dataset is ALL_RATE, which is the percentage of students who graduated from each school (i.e., Graduation Rate). This is the most interesting variable because it’s the best way to operationalize success in terms of the U.S. education system teaching our young people. My intuition is that Number of Students Per School, Pupil/Teacher Ratio and Revenue Per Student are going to be contributing factors to the graduation rate, so I’m very interested in those, too.
The first thing I do is create histograms for these four variables. Additionally, I plot the histogram of the Revenue Per Student variable with a log tranformation on the x-axis to better observe that variable’s distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 73.00 86.00 77.94 92.76 100.00 1049
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 36.0 102.0 173.6 270.0 1550.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.89 13.21 15.18 15.66 17.52 86.10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 9631 11330 13120 14290 129600
The Graduation Rate distribution is somewhat normal, though it’s obviously cut off at 100 since it’s impossible for a percentage to go over 100, and it has a long left tail.
The Number of Students Per School distribution appears to be exponential, which was surprising to me. I would have expected it to be normal as well, with only a few small schools, many average-sized schools, and a few large schools. However, it turns out that’s not the case. When the number of students is divided by the number of teachers to yield the Pupil Teacher Ratio, then the data does fit a normal distribution, though.
The Revenue Per Student distribution isn’t obvious from the linear histrogram, but transforming the x axis to a log scale reveals a close to normal distribution that is a little right skewed. So the Revenue Per Student distribution is approximately log-normal.
#Generate correlation matrix of hypothesis variables and Graduation Rate
data %>%
select(ALL_RATE,
ALL_COHORT,
Pupil.Teacher.Ratio,
Total.General.Revenue.Per.Student) %>%
rename(grad.rate=ALL_RATE,
num.students=ALL_COHORT,
pt.ratio=Pupil.Teacher.Ratio,
rev.per.stud=Total.General.Revenue.Per.Student) %>%
cor(use="na.or.complete")
## grad.rate num.students pt.ratio rev.per.stud
## grad.rate 1.00000000 0.2173667 -0.1186047 -0.03554802
## num.students 0.21736669 1.0000000 0.2879385 -0.10063393
## pt.ratio -0.11860466 0.2879385 1.0000000 -0.39486433
## rev.per.stud -0.03554802 -0.1006339 -0.3948643 1.00000000
The correlations between the three hypothesis variables and graduation rate are not strong (less than 0.3 in magnitude).
The strongest of the three is the Number of Students per School, which was not at all my first guess for strongest predictor. A possible explanation for this is that having a lot of peers increases the pressure for doing well in school. Additionally, larger schools are more likely to be in urban areas, so there could be a lot of other confounding factors that come along with that trend.
The Pupil/Teacher Ratio is weaker still, but at least the direction of the correlation is to be expected. It’s a negative correlation, so as the number of students per teacher goes up, the graduation rate tends to go down. However, this is a very weak relationship.
Finally, the last variable on the list was the one I expected to be the strongest predictor of graduation rate. Revenue Per Student has a paltry correlation constant of -0.035. There must be other factors at play here, because there is a significant negative correlation between Revenue Per Student and Pupil/Teacher Ratio. The two variables have a -0.372 correlation, so as Revenue Per Student goes up, there are fewer students for every teacher, and having fewer students per teacher is somewhat negatively correlated with higher graduation rates, but that chain of logic isn’t strong enough for there to be a correlation between Revenue Per Student and Graduation Rate.
Data exploration is about uncovering knowledge, and that knowledge is valuable if it confronts our preconceived notions about how something works. Usually, one thinks that this means showing a relationship between features that was previously undiscovered. However, it could just as easily be a demonstration that no relationship exists where “common sense” would lead one to believe a relationship exists. Based on these low correlation values, it appears that none of my hypothesis variables are strong predictors for Graduation Rate. For the remainder of my analysis, I will work on exploring this lack of a relationship.
Next, I want to do some regional comparisons. I start out with box plots of the four variables we’ve looked at so far by state.
It’s interesting to note that all of the outliers are below the mean. Given the Graduation Rate histrogram with its long left tail, this makes sense. Also, it’s impossible to have a Graduation Rate over 100%, so the higher a state’s average Graduation Rate, the less likely it is to have an outlier above the mean.
It makes sense that the more urban states have higher numbers of students for each graduating class. Other than that, it’s hard to tell much from this boxplot.
The variances for Pupil/Teacher Ratios are much lower than the other variables. While these means look low, it should be noted that Pupil/Teacher Ratio is not equivalent to class size. Class size is often 5-8 students higher. This is because not every teacher has a class every period.
The Revenue per Student box plot (which has a log y-axis), shows that there is a lot of data clustered around each state’s mean, since the boxes are all very narrow compared to the whiskers and outliers. Also, many of the states’ data is positively skewed.
Next, I took the three hypothesis variables and made a scatter plot for each agains Graduation Rate, along with a linear regression in red and a Generalized Additive Model (the default for geom_smooth when n>1000) in blue to better see the effect on Graduation Rate that each hypothesis variable has as it varies.
When plotting the Number of Students on the x axis and Graduation Rate on the y axis, we see a slight positive correlation, which matches the correlation coefficient we saw earlier. When we relax our model from being linear, we see that much of the increase occurs when the Number of Students is less than 100, and then it levels off a lot. Also, the banding that we see for cohort sizes under 200 is an effect of the rounding that occurs from the “blurring” effect applied to small schools discussed in “Loading and Cleaning The Data” section.
Looking at the scatterplot for Graduation Rate vs Pupil/Teacher Ratio, the relationship is even less clear. The linear model suggests a slight negative correlation, like we saw earlier, but the nonlinear model shows a complicated relationship, peaking around 15 students per teacher, and then decreasing as you move toward 20 students per teacher. Then there is a slight bump at about 22 teachers per student, but that’s probably because there are more schools at that level (likely due to state maximums for class size), which can be observed from the vertical banding at those levels. Also of note is the high variace around the model at the very low and very high Pupil/Teacher Rations. This is likely due to the fewer number of schools at these levels.
The linear model is the flattest for Revenue Per Student, which again matches the correlation coefficient we observed earlier, indicating that Revenue Per Student has the least predictive power of the three hypothesis variables. Looking at the nonlinear model, it’s even less clear what kind of relationship this hypothesis variable has with Graduation Rate. Most surprising to me is the drop off as you move right past about $22,000 per student.
While the boxplots and scatterplots lay out the data informatively, it’s harder for laypeople to glean any relationships from it because there is so much information in each plot. To make something that’s easier for the human eye to interpret, I want to map out the states geographically, since every U.S. citizen should be comfortable with that representation. First, I have to get each state’s latitude and longitude data. Then, I’ll group the school data by state. Finally, I’ll plot a heat map of each variable.
These heat maps are much more useful for comparing the four variables, especially for the untrained, because of their familiarity. Looking at the four maps, it’s again clear that the correlations between the three predictive variables and the Graduation Rates are less strong that I originally hypothesized.
For a more granular look, I will plot the same data on a county-by-county basis instead of by state so that the data isn’t as heavily grouped, especially in the larger states.
Even though these maps are more details, it’s even harder to eyeball correlations between them because the regions are so much smaller. There is definitely a need for a more encompassing, multivariate exploration.
While some relationships can be seen by comparing multiple plots above, I now plan to find some deeper relationships by combining multiple variables into single plots.
One aspect of the data that I’ve neglected thus far is the breakout of graduation rates among various groups of students:
ALL: All students combined
MAM: Native American students
MAS: Asian/Pacific Islander students
MBL: Black students
MHI: Hispanic students
MTR: Studens with two or more races
MWH: White students
CWD: Students with disabilities
ECD: Economically disadvantaged students
LEP: Students with limited English proficiency
To compare all of these cohorts with the overall graduation rate, I plotted a line graph connecting each state’s graduation rates for each cohort, ordered by the overall graduation rate (the black line).
Since these states are ordered by the overall graduation rate, I was looking for an upward or downward trend in any of the student cohorts to see if a particular student group either bolstered or undermined the total graduation rate. The strongest upward trend was for white students, but this alone doens’t mean that white students are a better indicator of a school’s success, since it’s likely that most schools are predominantly white, so I went on to graph the ratio of each cohort with the total student population to see if that was the case.
I start with the racial groups first, and then plot the other groups second.
In the race cohort plot, it’s clear that white students predominate, especially in Alaska, where erroneously there are more than 100% white students. But the gap between white students and other races doesn’t grow much as you go up in the rankings. Plus, states like California and Texas have good rankings and actually have more Hispanic students than White students.
In the other plot, while I was surprised to see the percentage of economically disadvantaged students across the nation as a whole, none of these three cohorts rise or fall that regularly as you go up in the states’ rankings.
The previous plots are a little hard to interpret, so I think a better way to demonstrate the racial make up of each state’s students is a stacked bar chart. I ordered the fills by the largest race in each particular state, and each fill represents the percentage of a particular race in a given state.
This representation really shows by how much White students are the predominant race, and the close relationship between white students’ graduation rates and total graduation rates is a result of this fact.
While these cohort differences are interesting, I’m going to now refocus back on my main hypothesis variables, namely, how is Graduation Rate affected by Number of Students per School, Pupil/Teacher Ratios, and Revenue per Student levels.
I tried a number of visualization combinations for Graduation Rate and the three hypothesis variables, but the data is clustered in such a way that I wasn’t able to show much. Here are some examples of those plots.
Due to the way my data is distributed, these two plots are very hard to interpret. This particular perspective has a lot of clustering, both spacially and chromatically (with many of the data points being on the dark end of the spectrum).
This plot colors each data point by state, but there’s too many colors to distinguish from, and the legend didn’t even fit on the plot, so I removed it altogether.
I’m starting to realize that the data is clustered in a way that makes multivariate analysis difficult. So I decide to do a few transformations that will make it easier to interpret.
This is already a nicer graph to look at, with a nice crescent shape. It also nicely shows how the three hypothesis variables relate to each other. It makes sense that the data exists in three quadrants. The empty, upper-right quadrant would be schools that spend a lot of money per student, yet still have very few teachers per student, which doesn’t seem like a likely scenario.
However, this graph is limited because it doens’t relate any of the variables to Graduation Rate, which as I’ve said is the most important variable to consider in this data set. Since I didn’t have much luck with all of the data as a whole, I decided to try something that worked pretty well earlier when I was plotting the heatmaps of the variables, and do a state by state comparison, singling out states one by one to suss out the relationships between the variables.
I start by ordering the states by Graduation Rate, so that I can look at the top and bottom states and show the relationship (or lackthereof) between that value and the other variables.
print.data.frame(
data %>% group_by(region) %>%
summarize(grad.rate=mean(ALL_RATE, na.rm=TRUE)) %>%
arrange(desc(grad.rate))
)
## region grad.rate
## 1 IOWA 88.26490
## 2 NEW HAMPSHIRE 87.42183
## 3 NEBRASKA 86.89516
## 4 NEW JERSEY 86.69148
## 5 VERMONT 86.57090
## 6 TENNESSEE 85.93794
## 7 MISSOURI 85.83430
## 8 KANSAS 85.59288
## 9 PENNSYLVANIA 85.56588
## 10 CONNECTICUT 85.50281
## 11 ARKANSAS 84.62547
## 12 INDIANA 84.25785
## 13 NEW YORK 84.14314
## 14 TEXAS 84.09530
## 15 MAINE 83.98745
## 16 WISCONSIN 83.59072
## 17 NORTH DAKOTA 83.30064
## 18 ILLINOIS 82.68784
## 19 SOUTH DAKOTA 82.49230
## 20 NORTH CAROLINA 82.31471
## 21 VIRGINIA 82.25354
## 22 MASSACHUSETTS 82.04639
## 23 MONTANA 81.03929
## 24 OHIO 80.73274
## 25 WEST VIRGINIA 80.59427
## 26 HAWAII 80.26880
## 27 CALIFORNIA 80.03340
## 28 RHODE ISLAND 77.21854
## 29 LOUISIANA 76.33099
## 30 MISSISSIPPI 75.77132
## 31 ALABAMA 75.55576
## 32 WYOMING 74.84923
## 33 MARYLAND 74.75619
## 34 UTAH 74.71588
## 35 SOUTH CAROLINA 73.21392
## 36 DELAWARE 72.14964
## 37 ARIZONA 70.70917
## 38 COLORADO 69.96055
## 39 MICHIGAN 69.42670
## 40 MINNESOTA 68.29072
## 41 NEW MEXICO 67.87550
## 42 OREGON 66.40423
## 43 GEORGIA 66.32037
## 44 WASHINGTON 65.63084
## 45 DISTRICT OF COLUMBIA 59.81687
## 46 NEVADA 59.58883
## 47 FLORIDA 56.63701
## 48 ALASKA 56.44120
Here are the graphs of the three hypothesis variables (Pupil/Teacher Ratio on the x axis, Revenue per Student on the y axis, and Number of Students per School for the size of each point) plotted by state. For each plot, the highlighted state is in blue, and all the other states’ data is in a salmon color with translucency added.
The first ten plots are the top ten states ranked by their Graduation Rate.
Nearly all of these schools have low Pupil/Teacher Ratios, whereas the Revenue per Student varies more. It’s notable that none of the three hypothesis variables are very consistent from state to state, which reinforces my narrative that none of these variables is an especially good predictor of Graduation Rate.
The next ten plots are the bottom ten states when ranked by Graduation Rate.
These states tend to have a higher Pupil/Teacher Ratio, but the Revenue per Student doesn’t have as strong of a trend. Some of these bottom states even look remarkingly similar to the top states. Take Alaska, which ranks last in Graduation Rate, and it’s in that low Pupil/Teach Ration and high Revenue per Student quartile, like many of the top states.
These lower ranked states have more within-state variation, which means that there are some schools bringing down the state average, but it’s not common that a majority of the schools in a state are all performing poorly.
I decided to test this hypothesis that lower ranking states have more variation in Graduation Rate on a school-to-school basis by ranking the states by this parameter.
state.grad_rate_variations <- data %>% group_by(region) %>%
summarize(grad.rate.variation=var(ALL_RATE, na.rm=TRUE),
grad.rate.mean=mean(ALL_RATE, na.rm=TRUE)) %>%
arrange(desc(grad.rate.variation))
print.data.frame(state.grad_rate_variations)
## region grad.rate.variation grad.rate.mean
## 1 FLORIDA 1085.89435 56.63701
## 2 MINNESOTA 911.69205 68.29072
## 3 MICHIGAN 905.07762 69.42670
## 4 WASHINGTON 792.13320 65.63084
## 5 ALASKA 770.84843 56.44120
## 6 NEVADA 745.72941 59.58883
## 7 DISTRICT OF COLUMBIA 730.99121 59.81687
## 8 ARIZONA 704.20157 70.70917
## 9 COLORADO 702.61280 69.96055
## 10 MARYLAND 666.78627 74.75619
## 11 OHIO 592.61754 80.73274
## 12 DELAWARE 581.63293 72.14964
## 13 OREGON 530.50594 66.40423
## 14 GEORGIA 525.60691 66.32037
## 15 CALIFORNIA 491.32554 80.03340
## 16 UTAH 459.22952 74.71588
## 17 NEW MEXICO 422.44132 67.87550
## 18 WISCONSIN 414.42036 83.59072
## 19 MASSACHUSETTS 406.11925 82.04639
## 20 ALABAMA 388.31329 75.55576
## 21 WYOMING 386.69483 74.84923
## 22 TEXAS 351.25127 84.09530
## 23 SOUTH CAROLINA 328.42452 73.21392
## 24 PENNSYLVANIA 266.59493 85.56588
## 25 INDIANA 256.41295 84.25785
## 26 SOUTH DAKOTA 254.44447 82.49230
## 27 RHODE ISLAND 241.98595 77.21854
## 28 NEW JERSEY 237.99696 86.69148
## 29 KANSAS 237.24395 85.59288
## 30 TENNESSEE 233.37246 85.93794
## 31 MONTANA 230.71152 81.03929
## 32 LOUISIANA 224.10395 76.33099
## 33 NEW YORK 218.09788 84.14314
## 34 MISSOURI 213.78781 85.83430
## 35 IOWA 212.05583 88.26490
## 36 CONNECTICUT 207.59165 85.50281
## 37 NORTH CAROLINA 200.38411 82.31471
## 38 ILLINOIS 193.71166 82.68784
## 39 HAWAII 190.31866 80.26880
## 40 NORTH DAKOTA 185.73402 83.30064
## 41 ARKANSAS 145.58466 84.62547
## 42 VIRGINIA 120.51495 82.25354
## 43 MISSISSIPPI 117.76825 75.77132
## 44 MAINE 108.90805 83.98745
## 45 NEBRASKA 98.58361 86.89516
## 46 NEW HAMPSHIRE 73.70071 87.42183
## 47 VERMONT 66.69807 86.57090
## 48 WEST VIRGINIA 61.00055 80.59427
with(state.grad_rate_variations, cor(grad.rate.variation, grad.rate.mean))
## [1] -0.829234
So the variation of Graduation Rate is very strongly negatively correlated with mean Graduation Rate. This makes a lot of sense because there’s a hard limit of a 100% Graduation Rate, so for schools to be at the top, most of their schools have to be near 100%. However, it does add some knowledge, because it could be the case that the states at the bottom have a lot of poorly performing schools, but this correlation indicates that those states actually have a mix of strong and weak schools.
Finally, I want to see how the state rankings by the hypothesis variables stack up agains the rankings by Graduation Rate. So I first compute those three new rankings, and then plot a slope plot for each to show how the rankings change. If any of the hypothesis variables were good predictors, then the slopes of the lines would be small, meaning that a ranking of the hypothesis variable is closer to the Graduation Rate ranking. That is, the states at the top of one ranking are in the top of the other ranking, and the same for the bottom states.
As you can see, the slopes of the lines in all three plots are large (giving the plot a very tangled appearance), which again supports the narrative that none of these three variables is a good predictor of Graduation Rate.
To clean up these slope plots, I decided to remove all of the states except the top 10 and bottom 10 in the first ranking of each plot. This shows the where the best and worst schools by each hypothesis variable end up in the Graduation Rate rankings.
Again, we see that the hypothesis variable rankings are not good indicators of where a state stacks up when ranked by Graduation Rate.
For my final visualizations, I picked 3 plots that show the relationship between Revenue per Student and Graduation Rate. They all fall on different parts of the “Information Density” versus “Accessibility” spectrum. The plots will go in order of increasing information density, meaning we’ll start with the most accessible.
One would expect a state’s ranking for ‘Revenue Per Student’ to be a strong indicator of that state’s ranking for ‘Graduation Rate’, but looking at this graph, it’s clear that the variation between the two rankings is very high, with the top 10 and bottom 10 schools in revenue get wildly different rankings for their graduation rate.
By showing the Graduation Rates and Revenues per Student for each state, one can see that the states with high graduation rates don’t necessarily have high Revenues per Student and vice versa, further illustrating the lack of correlation between the two variables.
This scatterplot shows the most information of these three plots. While there is a hint of a positive relationship between the two variables, the regression model shows that it is a tenuous relationship at best. In the upper and lower ranges for Revenue per Student, there aren’t many samples, so the variance is large. For the most dense section of the data, there isn’t a clear monotonic relationship, since the regression curve peaks and dips in several spots.
Combined, I believe these three plots demonstrate the lack of correlation between Graduation Rate and Revenue per Student. In my time working on this data exploration, I’ve shared this surprising finding with some of my friends, most of whom were skeptical of my claim since intuition suggests a relationship would exist. I would think that were I to show them these plots, they would be more convinced that the lack of a relationship between the two variables is, in deed, true.
At every step of my analysis of the hypothesis variables and their relationship to Graduation Rates, I became more convinced of the lack of influence that they had on Graduation Rates. Even though my initial intuition was that they would be very important, the data did not show this relationship. The best data analysis confronts our incorrect assumptions, so in that sense, my analysis was very successful.
I found it very difficult to visualize the lack of a relationship as opposed to visualizing the presence of a relationship, but I think that, even though not all of my attempts were successful, my analysis as a whole showed this lack of relationship.
I think the most successful graphs were the scatterplots with regression models since they showed each hypothesis variable and its complicated relationship with Graduation Rate.
Also, I was really happy with my multivariate analysis where I highlighted each state one by one. These plots combined all of my hypothesis variables into one graph per state, and comparing these plots showed how you couldn’t really make a guess at how the state would rank with respect to its average Graduation Rate just giving its distribution on that plot. One thing I would do if I continued work on this project would be to make an interactive version of these plots, allowing a user to select which state to highlight.
Additionally, if I had more time to devote ot this analysis, I would like to examine state by state education laws that might account for some of the difference I observed. For example, some schools have incentives for schools to keep their class sizes under a certain threshold, which would affect that state’s Pupil/Teacher Ratio.
Also, there are many other features of each school that might be better predictors than the three hypothesis variables I chose, so I would love to look at those other features to see if some combination of them explained why a particular school did well at getting students to graduate.
[2] https://nces.ed.gov/ccd/elsi/tableGenerator.aspx
[3] Udergraduate Economic Review Paper http://digitalcommons.iwu.edu/cgi/viewcontent.cgi?article=1047&context=uer
[4] U.S. Class Size Policy http://www.ecs.org/clearinghouse/81/95/8195.pdf
[5] Slopegraph Code https://acaird.github.io/computers/r/2013/11/27/slopegraphs-ggplot/